[7/n] [sync-switch] add single-transaction switch config read (not yet wired up)#10648
Conversation
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
| // Every switch port. The bootstore config is rack-global, | ||
| // so this is intentionally not filtered by rack. | ||
| let ports = load_all_switch_ports(&conn).await?; |
There was a problem hiding this comment.
I don't think this comment is correct w.r.t. the multirack plans (cc @andrewjstone to confirm/deny). IIRC each rack will have its own bootstore, since it's tied to the bootstrap network / trust quorum for replication?
I'm also not sure this is particularly actionable given the stack of work and current shape of the bg task. Maybe just worth rewording this comment and/or filing an issue on the multirack board for figuring out what to do here moving forward?
There was a problem hiding this comment.
Yeah, Andrew and I talked about it in person last week and I need to update this comment. (Yeah, each rack will have its own bootstore.)
Created using spr 1.3.6-beta.1
| .await?; | ||
| set_entry.insert(announcements); | ||
| } | ||
| let announcements = announce_cache[&set_id].clone(); |
There was a problem hiding this comment.
This can't panic because we always insert something if this set_id was vacant, but it still makes me nervous; could we shuffle this around so we don't use the indexing operator without losing any clarity? Untested but something like
let announcements = match announce_cache.entry(set_id) {
hash_map::Entry::Vacant(set_entry) => {
let announcements =
announce_dsl::bgp_announcement
.filter(
announce_dsl::announce_set_id
.eq(set_id),
)
.select(BgpAnnouncement::as_select())
.load_async::<BgpAnnouncement>(&conn)
.await?;
set_entry.insert(announcements).clone()
}
hash_map::Entry::Occupied(set_entry) => {
set_entry.get().clone()
}
};There was a problem hiding this comment.
Yeah, good idea.
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Part two of the fix for #10640. This wires up what was introduced in the previous commit, #10648, dropping the piecemeal reads. We aren't totally out of the woods yet, because the transactional read currently doesn't handle transient db issues as well as it could. We're going to fix that in a principled fashion in upcoming commits. Note that I haven't touched the mgd and other parts of the switch config in this commit, since @jgallagher is actively working on them. (Let me know if I should also include those changes either here or in a followup.) Depends on: * #10642 * #10643 * #10644 * #10645 * #10646 * #10647 * #10648
… silently dropping items (#10650) Previously, on encountering a transient DB issue, we would silently drop items from the bootstore config. That seems quite obviously bad and was part of the reason #10640 escalated a crdb bug into persistent bootstore corruption. Rework all this by: * collecting all the problems that occurred * returning a new error type if there were any problems **This is a functional change that is worth thinking very hard about.** Many but not all of the invariants here are maintained by the database layer. If there's persistent db corruption or a similar issue, we will no longer generate a bootstore config for the rack! I think this is overall the right thing to do, but it is quite possible there are latent issues that were being hidden by the looser error handling that this code previously had. The report is now logged by the background task, unless the task exits with one of the other early errors. (There's a much larger rework of the task that needs to be done, but I'm not doing that here since it'll collide with a lot of what @jgallagher is currently doing.) Depends on: * #10642 * #10643 * #10644 * #10645 * #10646 * #10647 * #10648 * #10649
…rts newtype (#10651) In prior commits in the stack, we made it so that transient DB errors don't result in a partially-generated config. This change encodes one specific form of that (non-empty ports) into the type system. Making bootstore deserialization fallible is an improvement in this case, because it turns a panic into a loop. Also include a change to make the rack ID in `StartSledAgentRequest` a typed `RackUuid`. This isn't directly related but is spiritually similar and I'm touching this code anyway. Depends on: * #10642 * #10643 * #10644 * #10645 * #10646 * #10647 * #10648 * #10649 * #10650
This is part one of the Nexus-side fix for #10640 -- it adds a way to read the entire switch config in a single transaction.
This isn't actually used yet -- the next commit in this series will start using it.
Depends on: